Dataset Curator

The Dataset Curator skill guides you through the critical process of preparing high-quality training data for machine learning models. Data quality is the single most important factor in model performance, yet it is often underinvested. This skill helps you systematically clean, validate, augment, and maintain datasets that lead to better models.

From initial collection to ongoing maintenance, this skill covers deduplication, label quality assessment, bias detection, augmentation strategies, and version control. It applies best practices from production ML systems to ensure your datasets are not just clean, but strategically optimized for your learning objectives.

Whether you are building a classifier, fine-tuning an LLM, or training a custom model, this skill ensures your data foundation is solid.

Core Workflows

Workflow 1: Assess Dataset Quality

Profile

the dataset:

Size and dimensionality

Label distribution and balance

Missing value patterns

Feature statistics

Identify

quality issues:

Duplicates (exact and near-duplicate)

Mislabeled examples

Outliers and anomalies

Data leakage

Bias and representation gaps

Measure

quality metrics:

def

assess_quality

(

dataset

)

:

return

{

"size"

:

len

(

dataset

)

,

"duplicate_rate"

:

find_duplicates

(

dataset

)

.

ratio

,

"missing_rate"

:

dataset

.

isnull

(

)

.

mean

(

)

,

"label_balance"

:

compute_entropy

(

dataset

.

labels

)

,

"outlier_rate"

:

detect_outliers

(

dataset

)

.

ratio

,

"estimated_label_noise"

:

estimate_label_noise

(

dataset

)

}

Prioritize

issues by impact

Create

remediation plan

Workflow 2: Clean and Prepare Data

Remove

duplicates:

Exact duplicates: hash-based dedup

Near-duplicates: similarity-based clustering

Decide: keep first, best, or merge

Handle

missing values:

Understand missingness mechanism (MCAR, MAR, MNAR)

Impute, drop, or flag appropriately

Fix

label quality:

Identify likely mislabels with confidence scoring

Route to human review or automatic correction

Document labeling guidelines

Normalize

and standardize:

Consistent formatting

Schema validation

Encoding standardization

Validate

cleaned dataset

Workflow 3: Augment and Balance

Analyze

class imbalance:

Compute imbalance ratios

Assess impact on model training

Apply

balancing strategies:

Oversampling minority classes (SMOTE, random)

Undersampling majority classes

Class weights in training

Generate

augmentations:

Text: paraphrase, synonym substitution, back-translation

Image: rotation, flip, color jitter, mixup

Tabular: noise injection, feature perturbation

Validate

augmentation quality:

Ensure augmented samples are realistic

Check for introduced biases

Version

and document changes

Quick Reference

Action

Command/Trigger

Assess quality

"Check quality of this dataset"

Find duplicates

"Find duplicates in dataset"

Clean labels

"Fix mislabeled data"

Balance classes

"Handle class imbalance"

Augment data

"Augment dataset for [task]"

Version dataset

"Set up dataset versioning"

Best Practices

Profile Before Processing

Understand your data before changing it

Compute statistics and visualize distributions

Document original state for reference

Identify patterns in issues

Preserve Provenance

Track every transformation

Version control datasets like code

Log all cleaning operations

Maintain mapping between original and cleaned data

Prioritize Label Quality

Garbage labels in, garbage model out

Invest in clear labeling guidelines

Use multiple annotators and measure agreement

Regular quality audits of labels

Test Cleaning Impact

Measure effect of cleaning

Train models on original vs cleaned data

Track which cleaning steps help most

Avoid cleaning that hurts performance

Stratify Splits Carefully

Maintain distribution in train/val/test
Stratify by label and key features
Keep related samples in same split
Ensure temporal ordering if applicable
Document Everything: Future you will thank present you Dataset cards with key statistics Known issues and limitations Collection methodology and biases Advanced Techniques Confident Learning for Label Noise Identify and fix mislabeled examples: from cleanlab import find_label_issues

Train model to get predicted probabilities

model . fit ( X_train , y_train ) pred_probs = model . predict_proba ( X_train )

Find likely mislabeled examples

issues

find_label_issues ( labels = y_train , pred_probs = pred_probs , return_indices_ranked_by = "self_confidence" )

Review and correct top issues

for idx in issues [ : 100 ] : review_and_correct ( X_train [ idx ] , y_train [ idx ] ) Similarity-Based Deduplication Remove near-duplicates using embeddings: def deduplicate_semantic ( texts , threshold = 0.95 ) : embeddings = embed ( texts ) clusters = cluster_by_similarity ( embeddings , threshold )

Keep one representative per cluster

deduplicated

[ ] for cluster in clusters : representative = select_best ( cluster )

longest, most recent, etc.

deduplicated . append ( representative ) return deduplicated Active Learning for Efficient Labeling Prioritize labeling effort: def active_learning_loop ( unlabeled_pool , labeled_set , budget ) : while len ( labeled_set ) < budget :

Train on current labeled data

model . fit ( labeled_set )

Score unlabeled by uncertainty

uncertainties

model . uncertainty ( unlabeled_pool )

Select most uncertain for labeling

to_label

select_top_k ( unlabeled_pool , uncertainties , k = 10 ) labels = human_label ( to_label )

Update sets

labeled_set . add ( to_label , labels ) unlabeled_pool . remove ( to_label ) return labeled_set Data Slice Analysis Find problematic subgroups: def find_weak_slices ( model , data , features ) :

Evaluate on all slices

slices

generate_slices ( data , features ) weak_slices = [ ] for slice_name , slice_data in slices : performance = evaluate ( model , slice_data ) if performance < overall_performance - threshold : weak_slices . append ( { "slice" : slice_name , "size" : len ( slice_data ) , "performance" : performance } ) return sorted ( weak_slices , key = lambda x : x [ "performance" ] ) Common Pitfalls to Avoid Cleaning test data the same way as training data (causes leakage) Over-aggressive deduplication that removes valid variations Imputing values without understanding the missingness mechanism Augmenting in ways that introduce unrealistic examples Ignoring class imbalance until model training fails Not versioning datasets, making experiments irreproducible Assuming more data is always better (quality > quantity) Failing to document data collection biases and limitations

dataset curator

安装

Train model to get predicted probabilities

Find likely mislabeled examples

issues

Review and correct top issues

Keep one representative per cluster

deduplicated

longest, most recent, etc.

Train on current labeled data

Score unlabeled by uncertainty

uncertainties

Select most uncertain for labeling

to_label

Update sets

Evaluate on all slices

slices